National Repository of Grey Literature 21 records found  1 - 10nextend  jump to record: Search took 0.01 seconds. 
Preprocessing and Transformation of Text Data Collections
Maruna, Viktor ; Burget, Radek (referee) ; Bartík, Vladimír (advisor)
This bachelor thesis deals with the issue of text-mining, mostly focused on preprocessing and transformation. In theoretical part there are contained information about development and principles of text-mining processes, text data collections and use in practice. The next part of this thesis describes in detail single steps of preprocessing and transformation of text data collections. In the final parts there are reviews of application development, testing and personal view on this thesis.
Stemming Methods Used in Text Mining
Adámek, Tomáš ; Chmelař, Petr (referee) ; Bartík, Vladimír (advisor)
The main theme of this master's thesis is a description of text mining. This document is specialized to English texts and their automatic data preprocessing. The main part of this thesis analyses various stemming algorithms (Lovins, Porter and Paice/Husk). Stemming is a procedure for automatic conflating semantically related terms together via the use of rule sets. Next part of this thesis describes design of an application for various types of stemming algorithms. Application is based on the Java platform with using of graphic library Swing and MVC architecture. Next chapter contains description of implementation of the application and stemming algorithms. In the last part of this master's thesis experiments with stemming algorithms and comparing the algorithm from viewpoint to the results of classification the text are described.
Determination of basic form of words
Šanda, Pavel ; Burget, Radim (referee) ; Karásek, Jan (advisor)
Lemmatization is an important preprocessing step for many applications of text mining. Lemmatization process is similar to the stemming process, with the difference that determines not only the word stem, but it´s trying to determines the basic form of the word using the methods Brute Force and Suffix Stripping. The main aim of this paper is to present methods for algorithmic improvements Czech lemmatization. The created training set of data are content of this paper and can be freely used for student and academic works dealing with similar problematics.
Methods for Mining Association Rules from Data
Uhlíř, Martin ; Burget, Radek (referee) ; Bartík, Vladimír (advisor)
The aim of this thesis is to implement Multipass-Apriori method for mining association rules from text data. After the introduction to the field of knowledge discovery, the specific aspects of text mining are mentioned. In the mining process, preprocessing is a very important problem, use of stemming and stop words dictionary is necessary in this case. Next part of thesis deals with meaning, usage and generating of association rules. The main part is focused on the description of Multipass-Apriori method, which was implemented. On the ground of executed tests the most optimal way of dividing partitions was set and also the best way of sorting the itemsets. As a part of testing, Multipass-Apriori method was compared with Apriori method.
Stemming of Czech Words
Hellebrand, David ; Bartík, Vladimír (referee) ; Chmelař, Petr (advisor)
The goal of this master's thesis is to develop stemming algorithm for czech language based on grammatical rules. You can find a description of stemming process and a comparsion of stemming algorithms in this project. The basics of czech grammar and Snowball language are also described here. The main part of this thesis concerns the implementation of the new czech stemming algorithm.
Derivation of Dictionary for Process Inspector Tool on SharePoint Platform
Pavlín, Václav ; Masařík, Karel (referee) ; Kreslíková, Jitka (advisor)
This master's thesis presents methods for mining important pieces of information from text. It analyses the problem of terms extraction from large document collection and describes the implementation using C# language and Microsoft SQL Server. The system uses stemming and a number of statistical methods for term extraction. This project also compares used methods and suggests the process of the dictionary derivation.
Slovak Lemmatization
Lipták, Šimon ; Dytrych, Jaroslav (referee) ; Smrž, Pavel (advisor)
Aim of this bachelor thesis was to become familiar with the tools and methods for morphological analysis and lemmatization of words, to design and to implement a system for lemmatization of slovak words, which are not in dictionary and then to write their forms, to process slovak data for implementation of stemming. At the end to score prediction based on testing and to compare with available alternatives.
Searching relevant articles in extensive collections
Vojt, Ján ; Novák, Jiří (advisor) ; Bartoš, Tomáš (referee)
Searching text in articles is usually implemented with fulltext search. Using more advanced techniques however, it is possible to achieve significantly better results. The subject of this work is to create a universal library for searching extensible collections, specialized in czech language. The library makes use of tools capable of working with morphology while considering importance of words. It also conducts an experiment with word pairs, which adds context into the search process. The success rate of this experiment is tried on an extensible collection of data. Created library is a unique tool for processing extensible collections of czech text, while at the same time it is ready for further extension by new languages and methods.
Morphological segmentation of Czech Words
Vidra, Jonáš ; Žabokrtský, Zdeněk (advisor) ; Mareček, David (referee)
In linguistics, words are usually considered to be composed of morphemes: units that carry meaning and are not further subdivisible. The task of this thesis is to create an automatic method for segmenting Czech words into morphemes, usable within the network of Czech derivational relations DeriNet. We created two different methods. The first one finds morpheme boundaries by differentiating words against their derivational parents, and transitively against their whole derivational family. It explicitly models morphophonological alternations and finds the best boundaries using maximum likelihood estimation. At worst, the results are slightly worse than the state of the art method Morfessor FlatCat, and they are significantly better in some settings. The second method is a neural network made to jointly predict segmentation and derivational parents, trained using the output of the first method and the derivational pairs from DeriNet. Our hypothesis that such joint training would increase the quality of the segmentation over training purely on the segmentation task seems to hold in some cases, but not in other. The neural model performs worse than the first one, possibly due to being trained on data which already contains some errors, multiplying them.
Searching relevant articles in extensive collections
Vojt, Ján ; Novák, Jiří (advisor) ; Bartoš, Tomáš (referee)
Searching text in articles is usually implemented with fulltext search. Using more advanced techniques however, it is possible to achieve significantly better results. The subject of this work is to create a universal library for searching extensible collections, specialized in czech language. The library makes use of tools capable of working with morphology while considering importance of words. It also conducts an experiment with word pairs, which adds context into the search process. The success rate of this experiment is tried on an extensible collection of data. Created library is a unique tool for processing extensible collections of czech text, while at the same time it is ready for further extension by new languages and methods.

National Repository of Grey Literature : 21 records found   1 - 10nextend  jump to record:
Interested in being notified about new results for this query?
Subscribe to the RSS feed.